## * The library is already synchronized with the lockfile.

Introduction

Background

This project will use data from the Eurovision Song Contest. The Eurovision Song Contest is a yearly competition where different countries submit an artist and song to compete in a live televised event against songs from other countries. Each participating country awards points to their favourite songs and the winning country with the most points is allowed to host the contest the following year. The first contest took place in 1956 and included 14 songs from 7 countries. Since then, the contest has grown dramatically, with the 2022 competition hosted in Turin Italy including songs from 40 countries. In recent years, the competition has gained popularity in countries outside Europe, with Australia joining the contest in 2015, the competition being televised in China and America, and America staging their own version of the competition in 2022 called the “American Song Contest”.

Over the 67 years that the competition has been held, the voting system that has been used to determine the winning song has changed drastically. Changes in the voting system have likely affected the number of points that are needed to win the competition.

Project Aim

This project will seek to visualize how changes in the voting system have affected the number of points accumulated by the winning song in each year of the competition. A secondary aim of this project will be to visualise which winning songs were the most successful in terms of the number of points they received.

Project organization

The /raw folder contains the raw data used for this project and /figs contains the figures produced from this project.

A codebook can be found in the project folder, which describes the variables and the functions used in the project.

Loading Packages

The renv package was used to store package versions used within the project.

Package versions are listed within the file /renv.lock.

#Load packages with renv
install.packages("renv")
library(renv)
renv::restore()

#Import packages
library(dplyr)
library(ggplot2)
library(here)
library(janitor)
library(plotly)
library(rvest)
library(tidyverse)
library(toOrdinal)

Data Origins

The raw data was scraped from the Eurovision Website https://eurovision.tv/history. The data for each year of the competition is located in nested links from this page. To be able to access the full data, you need to select the year you would like to view data from. If you wanted to access the data for the 2022 competition, you would be taken the following link https://eurovision.tv/event/turin-2022. From this link, you have to select whether to view data from the semi finals or the final. Semi-finals were introduced in the 2004 competition as a way of allowing more countries to compete. This project is only interested in gathering data from the final shows since this will contain the data relating to the winning songs. The following link contains a data table showing the results from the 2022 competition “https://eurovision.tv/event/turin-2022/final”. The data is presented in the same format for each year of the competition. For this project, I needed to scrape the tables from each year and combine them into a large dataset.

Data scraping

Before scraping the datatables, I first had to scrape the URL links for each year of the competition. These links were scraped from the page https://eurovision.tv/history using the web browser SelectorGadget. A problem I encountered at this stage was that the URL links varied in an unexpected way. For the competitions before 2004, the URL link finished with “/final”. Once semi finals were introduced in the 2004 competition, the URL link containing the final data now finished with “/grand-final”. The inconsistency of the URL links meant that I was unable to scrape all the URL links using a single piece of code. To resolve this issue, I had to scrape the data using four sets of code. One set of code was used to scrape data from page 1, containing the year links for competitions between 2007 and 2022. Another set of code was used to scrape the year links from pages 3, 4 and 5, which contained year links for the competitions between 1956 and 1991. Two sets of code had to be used to scrape data from page 2, which contained data from competitions that had “/final” at the end of their URL (1992-2003) and “/grand-final” at the end of their URL (2004-2006). Once I had collected the URL links from the Eurovision website, I scraped the data tables from each URL link using the lapply() function. The scraped tables were added to a new dataframe in R, resulting in four dataframes being created to contain the data scraped from each set of code. Below is an example of the code used to scrape data from pages 3, 4 and 5.

## trying to scrape the data from multiple pages - page 3, 4 and 5

df56 <- data.frame() # declaring df56 will be a data frame

# using for loop to download the URL's for from multiple pages
for(page_result in 2:4) {
  link = paste0("https://eurovision.tv/history?page=", page_result)
  page = read_html(link)
  
  # getting nested links for each year of the contest
  year_links = page %>% 
    html_nodes(".views-field-field-event-year a") %>% 
    html_attr("href") %>% 
    paste0("https://eurovision.tv", ., "/final")
  
  # retrieving the tables and the years from each year link
  get_data <- lapply(year_links, function(i) {
    year_page = read_html(i)
    year_data = year_page %>% 
      html_nodes("table.cols-7") %>% 
      html_table() %>%  
      .[[1]]
    year <- str_extract(i, "\\d{4}")
    year_data$year <- year
    
    
    return(year_data)
  })
  # Adding all data to a new datatable called df56
  df56 = rbind(df56, data.table::rbindlist(get_data))
  
}

Combining the scraped data

Once I had scraped all the Eurovision data and added them to four dataframes, I combined them into a single dataframe using the function rbind(). This dataframe was saved as a CSV file.

# Combining the four datasets into a dataframe and saving it as a csv

df_esc <- rbind(df01, df04, df92, df56)

write.csv(df_esc, here("raw", "esc_raw_data.csv"))

A limitation of this data scraping method

To scrape the URL’s from the second page on the Eurovision website https://eurovision.tv/history?page=1, two sets of code had to be used. These codes differed only in the URLs that they scraped, one piece of code scraped URL’s that ended in “/final” while the other scraped URL’s that ended in “/grand-final”. For each piece of code, SelectorGadget was used to highlight the location of the year links that I wanted to scrape on the page. This means that my code is dependent of the location of the year data on this page. Once the data from a new competition is added to the Eurovision website, this may shift the location of the year links on this page, which will affect the data that is scraped from the website by my R code. For example, code that is designed to scrape URL’s that end with “/final” may try to scrape URL’s from the 2004 competition. This could result in R attempting to locate a URL that does not exist.

To minimise the impact that changes to the Eurovision website have on this R project, a combined dataset containing all the information from the competitions between 1957 and 2023 was saved as a CSV file. The rest of the R code will run based off the data in this CSV file.

df <- read.csv(here("raw", "esc_raw_data.csv"))
head(df)
##   X R.O..Sort.descending     Half     Country  Participant             Song
## 1 1                    1        —     Czechia  We Are Domi       Lights Off
## 2 2                    2        —     Romania          WRS          Llámame
## 3 3                    3 1st half    Portugal         MARO Saudade, Saudade
## 4 4                    4        —     Finland   The Rasmus          Jezebel
## 5 5                    5 1st half Switzerland  Marius Bear      Boys Do Cry
## 6 6                    6        —      France Alvan & Ahez           Fulenn
##   Points Rank year
## 1     38 22nd 2022
## 2     65 18th 2022
## 3    207  9th 2022
## 4     38 21st 2022
## 5     78 17th 2022
## 6     17 24th 2022

Description of the data

The data contains 8 variables, which are: * R/O Sort descending: the order in which the songs performed in the contest * Half: whether the country performed in the first or second half of the contest * Country: the name of the country * Participant: the name of the artist * Song: the name of the song * Points: the total amount of points each country received * Rank: the rank which the song achieved in the contest * year: the year of the competition

Data Preparation

Cleaning the column names

The column names contained capitals and special characters. To avoid running into problems visualisation of the data using R, I used the clean_names() function from the janitor package to tidy these names.

#--------------TIDY DATA------------------

# cleaning the column names

df <- clean_names(df)

Removing the duplicate rows for 1969 competition

Since there were multiple winning songs in 1969, the 1969 competition has been repeated four times on the page “https://eurovision.tv/history?page=3”. This has resulted in the 1969 data being scraped multiple times. The below code was used to remove the duplicate rows for the 1969 competition.

At this point, I also wanted to select the columns that were relevant to the visualisation. The following columns were dropped because they contained data that would not be useful for the visualisation: half, r_o_sort_descending and x. The X column was added when the data was saved as a CSV and then read into R.

df1 <- df %>% 
  select(year, country, participant, song, points, rank) %>% # dropping the unnecessary columns
  unique() #remove duplicate rows 

Replacing missing data in the dataset

There are 51 pieces of missing data in the dataframe, under the points column. The Eurovision Website is inconsistent in how it represents zero points in the datatables, with some years using the number zero and others using a blank space. These blank spaces have been converted to NA’s. To avoid problems when presenting the data as a visualisation, I wanted to convert them back into 0’s.

A large proportion of the missing data is from the 1956 competition. The reason for this is that the points allocation process was not made public in 1956 only. Since this data does not contain any useful points data, it is unnecessary to keep this data in the dataframe.

# To count the NA's in each column: colSums(is.na(df))
# Dealing with missing data
df2 <- df1 %>% 
  replace_na(list(points = 0)) %>% # replace missing data with 0's
  filter(year != 1956) # filter out data from 1956

Additional information added to the dataframe

Voting system data was taken manually from the website Eurovision.World.com, a fan website dedicated to the Eurovision Song Contest (https://eurovisionworld.com/esc/voting-systems-in-eurovision-history). The reason this website was chosen was because the information is not presented on the eurovision website. Other information on the Eurovision.World.com has been checked with the Eurovision website for accuracy including the points totals in specific years of the competition. Data scraping was not conducted in this instance because the data was not presented in a format that would be presentable and easy to understand in a visualisation. Instead, the data was summarised and added to the data frame by creating a new column called voting system. The seven different voting systems were added to the column by create seven voting system variables and using the mutate function. The number at the end of each variable name represents the year that the voting system was introduced.

# Adding changes to the voting system - data from eurovision.tv

voting_system16 <- "Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs"
voting_system75 <- "One set of points (12, 10, 8, 7, ....1) awarded to ten songs"
voting_system57 <- "10 points split between one to ten songs"
voting_system62 <- "One set of points (3, 2, 1) awarded to three songs"
voting_system64 <- "One set of points (5, 3, 1) awarded to three songs"
voting_system63 <- "One set of points (5, 4, 3, 2, 1) awarded to five songs"
voting_system71 <- "2-10 points awarded to each song"

# adding a new column for voting system to the dataframe using mutate()

df3 <- df2 %>% 
  mutate(voting_system = case_when(year > 2015 ~ voting_system16,
                                   year > 1974 & year < 2016 ~ voting_system75, 
                                   year == 1957 | year == 1958 | 
                                     year == 1959 | year == 1960 | 
                                     year == 1961 | year == 1967 | 
                                     year == 1968 | year == 1969 |
                                     year == 1970 | year == 1974 ~ voting_system57,
                                   year == 1962 ~ voting_system62, 
                                   year == 1963 ~ voting_system63, 
                                   year == 1971 | year == 1972 | 
                                     year == 1973 ~ voting_system71, 
                                   year == 1964 | year == 1965 | 
                                     year == 1966 ~ voting_system64))

df3$voting_system <- factor(df3$voting_system, 
                            levels = c("10 points split between one to ten songs", 
                                       "One set of points (3, 2, 1) awarded to three songs", 
                                       "One set of points (5, 4, 3, 2, 1) awarded to five songs",
                                       "One set of points (5, 3, 1) awarded to three songs", 
                                       "2-10 points awarded to each song", 
                                       "One set of points (12, 10, 8, 7, ....1) awarded to ten songs",
                                       "Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs"))

Creating a dataframe with only the winning songs for each year

I decided to present the changes in the points allocation process by visualising the number of points accumulated by the winning songs. I felt that this would be a more intuitive and interesting method of visualisation than to use the total number of points or average amount of points allocated in each year. As a result, I filtered the dataset so that I would only keep data from the winning songs.

# Filter data that was ranked 1st

df4 <- df3 %>% 
  filter(rank == "1st")

head(df4)
##   year     country      participant            song points rank
## 1 2022     Ukraine Kalush Orchestra        Stefania    631  1st
## 2 2021       Italy         Måneskin   Zitti E Buoni    524  1st
## 3 2019 Netherlands  Duncan Laurence          Arcade    498  1st
## 4 2018      Israel            Netta             TOY    529  1st
## 5 2017    Portugal  Salvador Sobral Amar Pelos Dois    758  1st
## 6 2016     Ukraine           Jamala            1944    534  1st
##                                                   voting_system
## 1 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs
## 2 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs
## 3 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs
## 4 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs
## 5 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs
## 6 Two sets of points (12, 10, 8, 7, ....1) awarded to ten songs

The new dataframe contains four rows for the 1969 competition, because there were four winners in 1969. I want to contain the information for all the winning songs, participants and countries into one row, as this will make the dataset more concise and will support the visualisation of this data in the form of labels on each datapoint. I do not want to change the points, rank or year columns because they were identical for the four songs.

df4$country <- ifelse(df4$year == 1969,
                      paste(unique(df4$country[df4$year == 1969]), collapse = ", "),
                      df4$country)

df4$song <- ifelse(df4$year == 1969,
                       paste(unique(df4$song[df4$year == 1969]), collapse = ", "),
                       df4$song)

df4$participant <- ifelse(df4$year == 1969,
                    paste(unique(df4$participant[df4$year == 1969]), collapse = ", "),
                    df4$participant)

df5 <- df4 %>% 
  select(year, voting_system, country, participant, song, points) %>% 
  unique()

Visualisation

Labels for the graph

#--------------PLOT DATA------------------
# labels that will be added to the plot

plottitle <- "Points allocation in the Eurovision Song Contest"
xlab <- "Year"
ylab <- "Points"
grouplab <- "Voting system for each country"
caplab <- "Source: eurovision.tv/history"
sublab <- "Points total of the winning songs 1957-2022"

Creating visualisation 1

A scatter plot was most appropriate to visualise how the data changed over time. The year was used as the x axis and points as the y axis, with voting system was used as the grouping variable. The scale of the x axis was manually set to increase in sets of 5 to make the data easier to read. The size of each point was increased so that more of the graph was filled and the graph would look less empty and more appealing. However, the size was not increased further to avoid too much overlap between the points.

The plot was converted into a plotly interactive graph using the function ggplotly. This meant that the graph could have hover capabilities, so that each point displayed data about the winning song such as the winning country, song and participant when hovered over. Since I collapsed the four rows for 1969, the label for the 1969 point will contain data for all four winning songs. Another benefit of using a Plotly graph is that you can double click on a voting system that you would like to display points for and the plot will isolate the points from this voting system. Plotly can make it difficult to display subtitles and captions on a ggplot graph. These had to be added in once the graph had been converted to a plotly graph using the layout() function. The choice was made to display the legend in a horizontal fashion below the graph. This was appropriate as the legend was long and would have reduced the space available for the graph if presented vertically alongside the graph.

### Creating visualisation 1

# constructing the plot using ggplot
# Year as the x axis and share of points on the Y axis
# Voting system used as colour and text added for labels

p1 <- ggplot(df5, mapping = aes(x = year, 
                                y = points, 
                                col = voting_system,
                                text = paste("<br>Year:",year,
                                             "<br>Winning country:",country,
                                             "<br>Artist:",participant,
                                             "<br>Song:", song, 
                                             "<br>Points:",points)))

# Adding additional elements to the plot e.g., points, scale, title and legends

p1 <- p1 + geom_point(size = 2.5) +
  scale_x_continuous(limits = c(1955, 2022), 
                     breaks = seq(1955, 2022, 5)) +
  labs(title = plottitle, 
       x = xlab, 
       y = ylab,
       colour = grouplab)

# Converting the ggplot into a ggplotly animation with hover labels

p1 <- ggplotly(p1, tooltip = c("text"), width = 800,
               height = 600) %>% 
  layout(margin = list(l = 30, r = 30, b=60, t = 60, pad = 4), 
         title = list(text = paste0(plottitle,
                                    '<br>',
                                    '<sup>',
                                    sublab,'</sup>')),
         annotations = list(x = 1, y = -0.2,
                            text = caplab,
                            showarrow = FALSE,
                            xref = "paper", 
                            yref = "paper", 
                            xanchor="right", yanchor="auto", 
                            xshift = 0, yshift=0, 
                            font=list(size=13)))
p1 %>%  layout(legend = list(orientation = 'h', 
                             xanchor = "center",
                             x = 0.5, 
                             y =-0.2)) %>% 
  style(legendgroup = NULL)

Visualisation 1 summary

Visualisation 1 shows that the points total of the winning songs in the Eurovision Song Contest have increased drastically over the years. Changes in the points totals have coincided with changes in the voting system, particularly in the early years of the competition. In the first few years, where fewer points were given to a smaller number of countries, the points accumulation of each winning song was as low as 18 points in 1969. Changing the voting system in 1971 led to a rapid increase in the number of points given to the winning song, likely because each country had to award points to all the songs. Since 1975, the same voting system has been used, where each country awards a maximum of twelve points to their favourite song. The result of this was that the points accumulation of each winning song between 1975 and 2003 remained fairly consistent, with a slight increase likely due to the increasing number of participating countries. From 2004, there is a slight increase in the point totals, which can be explained by the large increase in the number of participating countries due to the introduction of the semifinals. Finally, the choice in 2016 for each country to award two sets of points, separating the public and jury vote, led to a dramatic increase in the points totals of the winning songs. This graph indicates that the most successful winning entry in the Eurovision Song contest is Portugal in 2017 with 758 points.

This visualisation fulfills the aim of this project, to show how changes in the voting system have affected the points totals of the winning songs and to display which winning songs have been the most successful according to points accumulated. However, a problem with this graph is that it does not control for the effect of the number of participating countries. The number of participating countries has increased over the years, which has likely affected the total number of points that each country can obtain. To accurately visualise how voting system has affected the points accumulation of each winning song, I controlled for the number of participating countries.

Adapting the plot to control for the number of countries

To control for the number of participating countries, I calculated the share of points that the winning songs received in each year. This was calculated by dividing the total number of points awarded in each year by the number of points allocated to the winning song.

# Create a new column to show share of points 
df6 <- df3 %>% 
  group_by(year) %>% 
  mutate(share_points = (points/(sum(points))*100))


# Change the share_point column to two decimal places 

df6 $share_points <- round(df6 $ share_points, 2)

df7 <- df6 %>% 
  filter(rank == "1st") # filter data to only include winning songs

df7$country <- ifelse(df7$year == 1969,
                      paste(unique(df7$country[df7$year == 1969]), collapse = ", "),
                      df7$country)

df7$song <- ifelse(df7$year == 1969,
                   paste(unique(df7$song[df7$year == 1969]), collapse = ", "),
                   df7$song)

df7$participant <- ifelse(df7$year == 1969,
                          paste(unique(df7$participant[df7$year == 1969]), collapse = ", "),
                          df7$participant)

df8 <- df7 %>% 
  unique()

Creating visualisation 2

sublab2 <- "Share of points of the winning songs 1957-2022"
ylab2 <- "Share of points (%)"
# constructing the plot using ggplot
# Year as the x axis and share of points on the Y axis
# Voting system used as colour and text added for labels

p2 <- ggplot(df8, mapping = aes(x = year, 
                                y = share_points, 
                                col = voting_system, 
                                text = paste("<br>Year:",year,
                                             "<br>Winning country:",country,
                                             "<br>Artist:",participant,
                                             "<br>Song:", song,
                                             "<br>Share of Points:",share_points, "%")))

# Adding additional elements to the plot e.g., points, scale, title and legends

p2 <- p2 + geom_point(size = 2.5) +
  scale_x_continuous(limits = c(1955, 2022), 
                     breaks = seq(1955, 2022, 5)) +
  labs(title = plottitle, 
       x = xlab, 
       y = ylab2, 
       colour = grouplab)

# Converting the ggplot into a ggplotly animation with hover labels

p2 <- ggplotly(p2, tooltip = c("text"), width = 700,
               height = 600) %>% 
  layout(margin = list(l = 30, r = 30, b=60, t = 60, pad = 4), 
         title = list(text = paste0(plottitle,
                                    '<br>',
                                    '<sup>',
                                    sublab2,'</sup>')),
         annotations = list(x = 1, y = -0.2,
                            text = caplab,
                            showarrow = FALSE,
                            xref = "paper", 
                            yref = "paper", 
                            xanchor="right", yanchor="auto", 
                            xshift = 0, yshift=0, 
                            font=list(size=13)))
p2 %>%  layout(legend = list(orientation = 'h', 
                             xanchor = "center",
                             x = 0.5, 
                             y =-0.2)) %>% 
  style(legendgroup = NULL)

Visualisation 2 summary

Unlike Visualisation 1, Visualisation 2 is able to successful depict the impact of the voting system changes to the points received by the winning songs, while controlling for the number of countries in each year. When comparing the two graphs, we can see that while the points total of winning songs has increased over the years, the share of points has been relatively consistent since 1975. In the early years of the contest (1957-1974), the frequent changes in the voting system led to large variability in the share of points achieved by the winning song. In 1964, Italy received the highest share of points of any song in any year in the competition with 34.03%. This was likely due to the voting system, where only three songs received points and the highest ranked song for each song would receive 2 points more than the second place song and 4 points more than the third place song. In contrast, the lowest share of points were achieved by the winning songs between 1971 and 1973, where each song is awarded 2-10 points. These large differences were do not appear in visualisation when the points totals are displayed. Since 1975, where a similar voting system has been used in each year, each winning song has received a share of points between 10% and 16%, aside for Azerbaijan in 2011. This outlier is likely caused by the fact that there were more countries in 2011 than any other (43 countries).

Project Summary

This project involved scraping data directly from the Eurovision website, cleaning the data, and making an interactive visualisation using ggplot and plotly.

Future Direction

I decided not to include data about whether the points were coming from a jury, public or 50/50. I did not feel this would have a drastic effect on the points total of the winning songs. Another reason for this decision was that I wanted to avoid making the visualisation overly-complex and difficult to interpret. However, it would be interesting for a future plot to compare points totals of songs in competitions where the points either came from the juries, public or a mix of the two.

Another interesting way to investigate the Eurovision data would be to visualise the effect of running order position on the points totals of each song in the contest. There is an assumption among Eurovision fans that the 2nd position in either the semi-finals or the grand-final is the worst position to perform from. It would be interesting to investigate whether this is true.

Finally, this Eurovision data could also be used to determine whether certain countries are more likely to give points to one another. Eurovision is seen by many to be politically motivated, where certain countries are more likely to do poorly than others. It has become a running joke in the Eurovision community that Cyprus and Greece will always give each other maximum points. It would be interesting to find a way to visualise this in the future.